Information Directed Sampling and Bandits with Heteroscedastic Noise

نویسندگان

Johannes Kirschner

Andreas Krause

چکیده

In the stochastic bandit problem, the goal is to maximize an unknown function via a sequence of noisy function evaluations. Typically, the observation noise is assumed to be independent of the evaluation point and satisfies a tail bound taken uniformly on the domain. In this work, we consider the setting of heteroscedastic noise, that is, we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. To address these shortcomings, we introduce a frequentist regret framework, that is similar to the Bayesian analysis of Russo and Van Roy (2014). We prove a new high-probability regret bound for general, possibly randomized policies, depending on a quantity we call the regret-information ratio. From this bound, we define a frequentist version of Information Directed Sampling (IDS) to minimize a surrogate of the regret-information ratio over all possible action sampling distributions. In order to construct the surrogate function, we generalize known concentration inequalities for least squares regression in separable Hilbert spaces to the case of heteroscedastic noise. This allows us to formulate several variants of IDS for linear and reproducing kernel Hilbert space response functions, yielding a family of novel algorithms for Bayesian optimization. We also provide frequentist regret bounds, which in the homoscedastic case are comparable to existing bounds for UCB, but can be much better when the noise is heteroscedastic. Finally, we empirically demonstrate in a linear setting, that some of our methods can outperform UCB and Thompson Sampling, even when the noise is homoscedastic.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Linear Multi-Resource Allocation with Semi-Bandit Feedback

We study an idealised sequential resource allocation problem. In each time step the learner chooses an allocation of several resource types between a number of tasks. Assigning more resources to a task increases the probability that it is completed. The problem is challenging because the alignment of the tasks to the resource types is unknown and the feedback is noisy. Our main contribution is ...

متن کامل

Information Directed Sampling for Stochastic Bandits with Graph Feedback

We consider stochastic multi-armed bandit problems with graph feedback, where the decision maker is allowed to observe the neighboring actions of the chosen action. We allow the graph structure to vary with time and consider both deterministic and Erdős-Rényi random graph models. For such a graph feedback model, we first present a novel analysis of Thompson sampling that leads to tighter perfor...

متن کامل

Sequential Matrix Completion

We propose a novel algorithm for sequential matrix completion in a recommender system setting, where the (i, j)th entry of the matrix corresponds to a user i’s rating of product j. The objective of the algorithm is to provide a sequential policy for user-product pair recommendation which will yield the highest possible ratings after a finite time horizon. The algorithm uses a Gamma process fact...

متن کامل

Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

I prove near-optimal frequentist regret guarantees for the finite-horizon Gittins index strategy for multi-armed bandits with Gaussian noise and prior. Along the way I derive finite-time bounds on the Gittins index that are asymptotically exact and may be of independent interest. I also discuss computational issues and present experimental results suggesting that a particular version of the Git...

متن کامل

Estimating Quality in User-Guided Multi-Objective Bandits Optimization

Many real-world applications are characterized by a number of conflicting performance measures. As optimizing in a multi-objective setting leads to a set of non-dominated solutions, a preference function is required for selecting the solution with the appropriate trade-off between the objectives. This preference function is often unknown, especially when it comes from an expert human user. Howe...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Information Directed Sampling and Bandits with Heteroscedastic Noise

نویسندگان

چکیده

منابع مشابه

Linear Multi-Resource Allocation with Semi-Bandit Feedback

Information Directed Sampling for Stochastic Bandits with Graph Feedback

Sequential Matrix Completion

Regret Analysis of the Finite-Horizon Gittins Index Strategy for Multi-Armed Bandits

Estimating Quality in User-Guided Multi-Objective Bandits Optimization

عنوان ژورنال:

اشتراک گذاری